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Abstract 

Part I of this paper investigates the differences — conceptually and algorithmically — between affine and 
projective frameworks for the tasks of visual recognition and reconstruction from perspective views. It 
is shown that an affine invariant exists between any view and a fixed view chosen as a reference view. 
This implies that for tasks for which a reference view can be chosen, such as in alignment schemes for 
visual recognition, projective invariants are not really necessary. The projective extension is then derived, 
showing that it is necessary only for tasks for which a reference view is not available — such as happens 
when updating scene structure from a moving stereo rig. The geometric difference between the two proposed 
invariants are that the affine invariant measures the relative deviation from a single reference plane, whereas 
the projective invariant measures the relative deviation from two reference planes. The affine invariant can 
be computed from three corresponding points and a fourth point for setting a scale; the projective invariant 
can be computed from four corresponding points and a fifth point for setting a scale. Both the affine and 
projective invariants are shown to be recovered by remarkably simple and linear methods. 
In part II we use the affine invariant to derive new algebraic connections between perspective views. It 
is shown that three perspective views of an object are connected by certain algebraic functions of image 
coordinates alone (no structure or camera geometry needs to be involved). In the general case, three views 
satisfy a trilinear function of image coordinates. In case where two of the views are orthographic and the 
third is perspective the function reduces to a bilinear form. In case all three views are orthographic the 
function reduces further to a linear form (the "linear combination of views" of [31]). These functions are 
shown to be useful for recognition, among other applications. 
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1 Introduction 

The geometric relation between objects (or scenes) in 
the world and their images, taken from different viewing 
positions by a pin-hole camera, has many subtleties and 
nuances and has been the subject of research in computer 
vision since its early days. Two major areas in computer 
vision have been shown to benefit from an analytic treat- 
ment of the 3D to 2D geometry: visual recognition and 
reconstruction from multiple views (as a result of having 
motion sequences or from stereopsis). 

A recent approach with growing interest in the past 
few years is based on the idea that non-metric informa- 
tion, although weaker than the information provided by 
depth maps and rigid camera geometries, is nonetheless 
useful in the sense that the framework may provide sim- 
pler algorithms, camera calibration is not required, more 
freedom in picture-taking is allowed — such as taking 
pictures of pictures of objects, and there is no need to 
make a distinction between orthographic and perspective 
projections. The list of contributions to this framework 
include (though not intended to be complete) [14, 26, 
33, 34, 9, 20, 3, 4, 28, 29, 19, 31, 23, 5, 6, 18, 27, 13, 12] 
— and relevant to this paper are the work described in 
[14, 4, 26, 28, 29]. 

This paper has two parts. In Part I we investi- 
gate the intrinsic differences — conceptually and algo- 
rithmically — between an affine framework for recog- 
nition/reconstruction and a projective framework. Al- 
though the distinction between affine and projective 
spaces, and between affine and projective properties, is 
perfectly clear from classic studies in projective and alge- 
braic geometries, as can be found in [8, 24, 25], it is less 
clear how these concepts relate to reconstruction from 
multiple views. In other words, given a set of views, un- 
der what conditions can we expect to recover affine in- 
variants? what is the benefit from recovering projective 
invariants over affine? are there tasks, or methodologies, 
for which an affine framework is completely sufficient? 
what are the relations between the set of views generated 
by a pin-hole camera and the set of all possible projec- 
tions V 3 I— ► V 2 of a particular object? These are the 
kinds of questions for which the current literature does 
not provide satisfactory answers. For example, there is a 
tendency in some of the work listed above, following the 
influential work of [14], to associate the affine framework 
with reconstruction/recognition from orthographic views 
only. As will be shown later, the affine restriction need 
not be coupled with the orthographic restriction on the 
model of projection — provided we set one view fixed. In 
other words, an uncalibrated pin-hole camera undergo- 
ing general motion can indeed be modeled as an "affine 
engine" provided we introduce a "reference view", i.e., 
all other views are matched against the reference view 
for recovering invariants or for achieving recognition. 

In the course of addressing these issues we derive two 
new, extremely simple, schemes for recovering geometric 
invariants — one affine and the other projective — which 
can be used for recognition and for reconstruction. 

Some of the ideas presented in this part of the pa- 
per follow the work of [14, 4, 26, 28, 29]. Section 3 on 
affine reconstruction from two perspective views, follows 



and expands upon the work of [26, 14, 4]. Section 4 on 
projective reconstruction, follows and refines the results 
presented in [28, 29]. 

In Part II of this paper we use the results established 
in Part I (specifically those in Section 3) to address cer- 
tain algebraic aspects of the connections between mul- 
tiple views. Inspired by the work of [31], we address 
the problem of establishing a direct connection between 
views, expressed as functions of image coordinates alone 
— which we call "algebraic functions of views" . In addi- 
tion to linear functions of views, discovered by [31], ap- 
plicable to orthographic views only, we show that three 
perspective views are related by trilinear functions of 
their coordinates, and by bilinear functions if two of the 
three views are assumed orthographic — a case that will 
be argued is relevant for purposes of recognition without 
constraining the generality of the recognition process. 
Part II ends with a discussion of possible applications 
for algebraic functions, other than visual recognition. 

2 Mathematical Notations and 
Preliminaries 

We consider object space to be the three-dimensional 
projective space V 3 , and image space to be the two- 
dimensional projective space V 2 . Within V 3 we will be 
considering the projective group of transformations and 
the affine group. Below we describe basic definitions and 
formalism related to projective and affine geometries — 
more details can be found in [8, 24, 25]. 

2.1 Affine and Projective Spaces 

Affine space over the field K is simply the vector space 
K n , and is usually denoted as A" . Projective space V" 
is the set of equivalence classes over the vector space 
K n+1 . A point in V" is usually written as a homoge- 
neous vector (xo, ...,x n ), which is an ordered set of n + 1 
real or complex numbers, not all zero, whose ratios only 
are to be regarded as significant. Two points x and y 
are equivalent, denoted by s — y, if x = Xy for some 
scalar A. Likewise, two points are distinct if there is no 
such scalar. 

2.2 Representations 

The points in V" admit a class of coordinate represen- 
tations 1Z such that if IZo is any one allowable repre- 
sentation, the whole class 1Z consists of all those rep- 
resentations that can be obtained from TZo by the ac- 
tion of the group GL n+ \ of (n + 1) x (n + 1) non- 
singular matrices. It follows, that any one coordinate 
representation is completely specified by its standard 
simplex and its unit point. The standard simplex is 
the set of n + 1 points which have the standard coor- 
dinates (1, 0, ..., 0), (0, 1, 0, ..., 0), ..., (0, 0, ..., 0, 1) and the 
unit point is the point whose coordinates are (1, 1, ..., 1). 
It also follows that the coordinate transformation be- 
tween any two representations is completely determined 
from n + 1 corresponding points in the two representa- 
tions, which give rise to a linear system of (n + l) 2 — 1 
or (n + l) 2 equations (depending on whether we set an 
arbitrary element of the matrix transform, or set one of 
the scale factors of the corresponding points). 



2.3 Subspaces and Cross Ratios 

A linear subspace A = V k C V" is a hyperplane if k = 
n — 1, is a line when k = 1, and otherwise is a k-plane. 
There is a unique line in V" through any two distinct 
points. Any point zona line can be described as a linear 
combination of two fixed points x,y on the line, i.e., 
z = x + ky. Let v = x + k'y be another point on the line 
spanned by x, y, then the cross ratio of the four points is 
simply a = k/k' which is invariant in all representations 
1Z. By permuting the four points on the line the 24 
possible cross ratios fall into six sets of four with values 
a, l/a, I — a, (a — l)/a, a /(a — 1) and 1/(1 — a). 

2.4 Projections 

Let V n ~ l C V" be some hyperplane, and a point O £ 
V" not lying on V n ~ x . If we like, we can choose the 
representation such that V n ~ l is given by x n = and 
the point O = (0, 0, ..., 0, 1). We can define a map 

a :V n -{0}^V n - x 

by 

(t -.p^opdv"- 1 ; 

that is, sending a point P £ V n other than O to the point 
of intersection of the line OP with the hyperplane V n ~ x . 
<t is the projection from the point O to the hyperplane 
V n ~ x , and the point O is called the center of projection 
(COP). In terms of coordinates x, this amounts to 

(T : (x , ..., X n ) H-> (x , ..., I„-l). 

As an example, the projection of 3D objects onto an 
image plane is modeled by x h- ► Tx, where T is a 3 x 
4 matrix, often called the camera transformation. The 
set S of all views of an object (ignoring problems of 
self occlusion, i.e., assuming that all points are visible 
from all viewpoints) is obtained by the group GL4 of 
4x4 non-singular matrices applied to some arbitrary 
representation of V 3 , and then dropping the coordinate 
x 3 . 

2.5 The Affine Subgroup 

Let Ai C V" be the subset of points (xo, ..., x n ) with 
Xi 7^ 0. Then the ratios Xj = Xj/xi are well defined and 
are called affine or Euclidean coordinates on the projec- 
tive space, and Ai is bijective to the affine space A", 
i.e. Ai = A" . The affine subgroup of GL n+ \ leaves 
the hyperplane x, = invariant under all affine repre- 
sentations. Any subgroup of GL n+ \ that leaves some 
hyperplane invariant is an affine subgroup, and the in- 
variant hyperplane is called the ideal hyperplane. As an 
example, a subgroup of GL4 that leaves some plane in- 
variant is affine. It could be any plane, but if it is the 
plane at infinity (x 2 = 0) then the mapping V 3 1— ► V 2 
is created by parallel projection, i.e., the COP is at in- 
finity. Since two lines are parallel if they meet on the 
ideal hyperplane, then when the ideal hyperplane is at 
infinity, affine geometry takes its "intuitive" form of pre- 
serving parallelism of lines and planes and preserving 
ratios. The importance of the affine subgroups is that 
there exist affine invariants that are not projective in- 
variants. Parallelism, the concept of a midpoint, area of 
triangles, classification of conies are examples of affine 
properties that are not projective. 



2.6 Epipoles 

Given two cameras with positions of their COP at 
0,0' £ V 3 , respectively, the epipoles are at the intersec- 
tion of the line OO' with both image planes. Recovering 
the epipoles from point correspondences across two views 
is remarkably simple but is notoriously sensitive to noise 
in image measurements. For more details on recovering 
epipoles see [4, 29, 28, 5], and for comparative and error 
analysis see [17, 22]. In Part I of this paper we assume 
the epipoles are given; in Part II, where we make further 
use of derivations made in Section 3, we show that for 
purposes discussed there one can eliminate the epipoles 
altogether. 

2.7 Image Coordinates 

Image space is V 2 . Since the image plane is finite, we can 
assign, without loss of generality, the value 1 as the third 
homogeneous coordinate to every image point. That is, 
if (x, y) are the observed image coordinates of some point 
(with respect to some arbitrary origin — say the geomet- 
ric center of the image), then p = (x,y, 1) denotes the 
homogeneous coordinates of the image plane. Note that 
by this notation we are not assuming that an observed 
point in one image is always mapped onto an observed 
(i.e., not at infinity) point in another view (that would 
constitute an affine plane) — all what we are relying 
upon is that points at infinity are not observed anyway, 
so we are allowed to assign the value 1 to all observed 
points. 

2.8 General Notations 

Vectors are always column vectors, unless mentioned 
otherwise. The transpose notation will be added only 
when otherwise there is a chance for confusion. Vectors 
will be in bold-face only in conjunction with a scalar, i.e., 
Xx stands for the scalar A scaling the vector x. Scalar 
product will be noted by a center dot, i.e., x ■ y, again 
avoiding the transpose notation except when necessary. 
Cross product will be denoted as usual, i.e., x x y. The 
cross product, viewed as an operator, can be used be- 
tween a vector x and a 3 x 3 matrix A as follows: 



x x A 



x 2 a 3 
x 3 a x 

x\a 2 



x 3 a 2 
x\a 3 

x 2 ai 



where 01,02,03 are the row vectors of A, and x = 
(xi,x 2 ,x 3 ). 

Part I 

3 Affine Structure and Invariant From 
Two Perspective Views 

The key idea underlying the derivations in this section is 
to place the two camera centers as part of the reference 
frame (simplex and unit point) of V 3 . Let Pi, P 2 , P 3 be 
three object points projecting onto corresponding points 
Pj,p'j, j = 1, 2, 3, in the two views. We assign the coor- 
dinates (1,0, 0,0), (0,1, 0,0), (0,0, 1,0) to P 1 ,P 2 ,P 3 , re- 
spectively. For later reference, the plane passing through 




(0,0,0,1) 



Figure 1: 



Pi,P 2 , P 3 will be denoted by tti. Let O be the COP of 
the first camera, and O' the COP of the second camera. 
We assign the coordinates (0, 0, 0, 1), (1,1,1,1) to O, O' , 
respectively (see Figure 1). This choice of representation 
is always possible because the two cameras are part of 
V 3 . By construction, the point of intersection of the line 
OO' with ir i has the coordinates (1, 1, 1, 0) (note that 7Ti 
is the plane x 3 = 0, therefore the linear combination of 
O and O' with xs = must be a multiple of (1, 1, 1, 0)). 

Let P be some object point projecting onto p,p' . The 
line OP intersects 7Ti at the point (a,/3, 7,0). The coor- 
dinates a, /?, 7 can be recovered by projecting the image 
plane onto 7Ti, as follows. Let v, v' be the location of both 
epipoles in the first and second view, respectively (see 
Section 2.6). Given the epipoles v and v' , we have by our 
choice of coordinates that pi,P2, P3 and v are projectively 
(in V 2 ) mapped onto e\ = (1, 0, 0), e 2 = (0, 1, 0), e^ = 
(0, 0, 1) and e 4 = (1, 1, 1), respectively. Therefore, there 
exists a unique element A\ £ PGL3 (3x3 matrix defined 
up to a scale) that satisfies A\pj = e j , j = 1,2,3, and 
Aiv = e.4. Note that we have made a choice of scale by 
setting Aiv to e.4, this is simply for convenience as will 
be clear later on. It follows that A\p = (a, /?, 7). 

Similarly, the line 0'_P intersects 7Ti at (a', /?', 7', 0). 
Let A 2 £ PGL3 be defined by ^2^'- — ej, j = 1, 2, 3, and 
A 2 i/ = e 4 . It follows that A 2 p' = (a',/3',y'). Since P 
can be described as a linear combination of two points 
along each of the lines OP, and O'P, we have the fol- 
lowing equation: 



P 




i 





from which it immediately follows that k = s. We have 
therefore, by the choice of putting both cameras on the 
frame of reference, that the transformation in V 3 is affine 
(the plane 7Ti is preserved). If we leave the first camera 
fixed and move the second camera to a new position 
(must be a general position, i.e., O' (fi 7Ti), then the 
transformation in V 3 belongs to the same affine group. 



Note that since only ratios of coordinates are significant 
in V" , k is determined up to a uniform scale, and any 
point P (fi ir 1 can be used to set a mutual scale for 
all views — by setting an appropriate scale for v' , for 
example. The value of k can easily be determined as 
follows: we have 






Multiply both sides by A 2 for which we get 



p,p 



Ap — kv' , 



(1) 



where A = A 2 1 A i . Note that A £ PGL 3 is a 
collineation between the two image planes, due to 7Ti, 
determined by p'- = Apj, j = 1,2,3, and Av = v' (there- 
fore, can be recovered directly without going through 
Ai,A 2 ). Since k is determined up to a uniform scale, 
we need a fourth correspondence p ,p' , and let A, or v' , 
be scaled such that p' = Ap — v' . Then k is an affine 
invariant, which we will refer to as "affine depth". Fur- 
thermore, (x,y,l,k) are the homogeneous coordinates 
representation of P, and the 3x4 matrix [A,— v'] is a 
camera transformation matrix between the two views. 
Note that k is invariant when computed against a refer- 
ence view (the first view in this derivation), the camera 
transformation matrix does not only depend on the cam- 
era displacement but on the choice of three points, and 
the camera is an "affine engine" if a reference view is 
available. More details on theoretical aspects of this re- 
sult are provided in Section 3.2, but first we discuss its 
algorithmic aspect. 

3.1 Two Algorithms: Re-projection and Affine 
Reconstruction from Two Perspective 
Views 

On the practical side, we have arrived to a remarkably 
simple algorithm for affine reconstruction from two per- 
spective/orthographic views (with an uncalibrated cam- 
era), and an algorithm for generating novel views of a 
scene (re-projection). For reconstruction we follow these 
steps: 

1. Compute epipoles v,v' (see Section 2.6). 

2. Compute the matrix A that satisfies Apj = p'j , j = 
1,2,3, and Av = v' . This requires a solution of a 
linear system of eight equations (see Appendices in 
[19, 27, 28] for details). 

3. Set the scale of v' by using a fourth corresponding 
pair Po,p' such that p' g = Ap — v' . 

4. For every corresponding pair p,p' recover the affine 
depth k that satisfies p' = Ap — kv' . As a technical 
note, k can be recovered in a least-squares fashion 
by using cross-products: 

(p' x v') T (p' x Ap) 
~ \\p' x v' || 2 ' 

Note that k is invariant as long as we use the first view 
as a reference view, i.e., compute k between a reference 
view p and any other view. The invariance of k can be 



used to "re-project" the object onto any third view p" 
as follows. We observe: 



P 



Bp - kv", 



for some (unique up to a scale) matrix B and epipole v" . 
One can solve for B and v" by observing six correspond- 
ing points between the first and third view. Each pair of 
corresponding points Pj,p'{ contributes two equations: 

b 3 iXjx" + b 3 2yjXj—kjV^Xj + x" = 

bnXj + b 12 yj + &13 - kjv'l, 

bziXjy" + byzVjy'j-kjv'zy" + y'- = 

b'JlXj + &22%' + &23 - kjv'2, 

where 633 = 1 (this for setting an arbitrary scale because 
the system of equations is homogeneous — of course 
this prevents the case where 633 = 0, but in practice 
this is not a problem; also one can use principal compo- 
nent analysis instead of setting the value of some cho- 
sen element of B or v"). The values of kj are found 
from the correspondences Pj,p'j, j = 1,...,6 (note that 
k\ = k'i = ks = 0). Once B,v" are recovered, we can 
find the location of p" for any seventh point pi, by first 
solving for hi from the equation p\ = Api — hiV 1 , and then 
substituting the result in the equation p'[ = Bpi — hiv" . 

3.2 Results of Theoretical Nature 

Let ip G S be some view from the set of all possible 
views, and let P\,P2,P3 G ip be non-collinear points 
projected from some plane ir. Also, let S w C S be the 
subset of views for which the corresponding pairs of pj , 
j = 1,2,3, are non-collinear (A is full rank). Note that 
S w contains all views for which the COP is not on ir. We 
have the following result: 

There exists an affine invariant between a reference view 
ip and the set of views S w . 

The result implies that, within the framework of un- 
calibrated cameras, there are certain tasks which are in- 
herently affine and, therefore, projective invariants are 
not necessary and instead affine invariants are sufficient 
(it is yet to be shown when exactly do we need to recover 
projective invariants — this is the subject of Section 4). 
Consider for example the task of recognition within the 
context of alignment [30, 11]. In the alignment approach, 
two or more reference views (also called model views), 
or a 3D model, are stored in memory — and referred to 
as a "model" of the object. During the recognition pro- 
cess, a small number of corresponding points between 
the reference views and the novel view are used for "re- 
projecting" the object onto the novel viewing position 
(as for example using the method described in the previ- 
ous section). Recognition is achieved if the re-projected 
image is successfully matched against the input image. 
This entails a sequential search over all possible models 
until a match is found between the novel view and the 
re-projected view using a particular model. The impli- 
cation of the result above is that since alignment uses 




Figure 2: 



a fixed set of reference views of an object to perform 
recognition, then only affine machinery is really neces- 
sary to perform re-projection. As will be shown in Sec- 
tion 4, projective machinery requires more points and 
slightly more computations (but see Section 9 for dis- 
cussion about practical considerations). 

The manner in which affine-depth was derived gives 
rise to a refinement on the general result that four corre- 
sponding points and the epipoles are required for affine 
reconstruction from two perspective views [4, 29]. Our 
derivation shows that in addition to the epipoles, we 
need only three points to recover affine structure up to 
a uniform scale, and therefore the fourth point is needed 
only for setting such a scale. To summarize, 

In case where the location of epipoles are known, then 
three corresponding points are sufficient for computing 
the affine structure, up to a uniform but unknown scale, 
for all other points in space projecting onto correspond- 
ing points in both views. 

We have also, 

Affine shape can be described as the ratio of a point P 
from a plane and the COP, normalized by the ratio of a 
fixed point from the reference plane and the COP. 

Therefore, affine-depth k depends only three points 
(setting up a reference plane), the COP (of the reference 
view) and a fourth point for setting a scale. This way 
of describing structure relative to a reference plane is 
very similar to what [14] suggested for reconstruction 
from two orthographic views. The difference is that there 
the fourth point played the role of both the COP and 
for setting a scale. We will show next that the affine- 
depth structure description derived here reduces exactly 
to what [14] described in the orthographic case. 

There are two ways to look at the orthographic case. 
First, when both views are orthographic, the collineation 
A (in Equation 1) between the two images is an affine 
transformation in V 2 , i.e., third row of A is (0,0,1). 
Therefore, A can be computed from only three corre- 



sponding points, Apj = p'- , j = 1,2,3. Because both O 
and 0' are at infinity, then the epipole v' is on the plane 
X2 = 0, i.e., i>3 = 0, and as a result all epipolar lines 
are parallel to each other. A fourth corresponding point 
Po,p' can be used to determine both the direction of 
epipolar lines and to set the scale for the affine depth of 
all other points — as described in [14]. We see, therefore, 
that the orthographic case is simply a particular case of 
Equation 1. Alternatively, consider again the structure 
description entailed by our derivation of affine depth. If 
we denote the point of intersection of the line OP with 
7Ti by P , we have (see Figure 2) 



p- 


-p 


p- 


-o 


Po- 


-Po 



Let O (the COP of the first camera) go to infinity, in 
which case affine-depth approaches 

P-P 



P-P 

which is precisely the way shape was described in [14] 
(see also [26, 27]). In the second view, if it is or- 
thographic, then the two trapezoids P,P,p',Ap and 
P , P ,p' , Ap are similar, and from similarity of trape- 
zoids we obtain 



P-P 

Pn-Pn 



p' — Ap 



Po ~ A Po 

which, again, is the expression described in [14, 26]. Note 
that affine-depth in the orthographic case does not de- 
pend any more on O, and therefore remains fixed regard- 
less of what pair of views we choose, namely, a reference 
view is not necessary any more. This leads to the fol- 
lowing result: 

Let S C S be the subset of views created by means of 
parallel projection, i.e., the plane X2 = is preserved. 
Given four fixed reference points, affine-depth on S is 
reference-view- dependent, whereas affine-depth on S is 
reference-view-independent. 

Consider next the resulting camera transformation 
matrix [A, —v']. The matrix A depends on the choice of 
three points and therefore does not only depend on the 
camera displacement. This additional degree of freedom 
is a direct result of our camera being uncalibrated, i.e., 
we are free to choose the internal camera parameters (fo- 
cal length, principal point, and image coordinates scale 
factors) as we like. The matrix A is unique, i.e., depends 
only on camera displacement, if we know in advance that 
the internal camera parameters remain fixed for all views 
S w . For example, assume the camera is calibrated in the 
usual manner, i.e., focal length is 1, principle point is at 
(0, 0, 1) in Euclidean coordinates, and image scale factors 
are 1 (image plane is parallel to xy plane of Euclidean 
coordinate system). In that case A is an orthogonal ma- 
trix and can be recovered from two corresponding points 
and the epipoles — by imposing the constraint that vec- 
tor magnitudes remain unchanged (each point provides 



three equations). A third corresponding point can be 
used to determine the reflection component (i.e., mak- 
ing sure the determinant of A is 1 rather than —1). More 
details can be found in [27, 15]. Since in the uncalibrated 
case A is not unique, let A w denote the fact that A is 
the collineation induced by a plane 7r, and let k w denote 
the fact that the affine-depth also depends on the choice 
of 7r. We see, therefore, that there exists a family of 
solutions for the camera transformation matrix and the 
affine-depth as a function of ir. This immediately implies 
that a naive solution for A, k, given v' , from point corre- 
spondences leads to a singular system of equations (even 
if many points are used for a least-squares solution). 

Given the epipole v' , the linear system of equations for 
solving for A and kj of the equation 

Hp'j = Apj — kjv' , 

from point correspondences pj ,p'j is singular, unless fur- 
ther constraints are introduced. 



We see that equation counting alone is not sufficient 
for obtaining a unique solution, and therefore the knowl- 
edge that A is a homography of a plane is critical for this 
task. For example, one can solve for A and kj from many 
correspondences in a least-squares approach by first set- 
ting kj = 0, j = 1,2,3 and k^ = 1, otherwise the solution 
may not be unique. 

Finally, consider the "price" we are paying for an un- 
calibrated, affine framework. We can view this in two 
ways, somewhat orthogonal. First, if the scene is un- 
dergoing transformations, and the camera is fixed, then 
those transformations are affine in 3D, rather than rigid. 
For purposes of achieving visual recognition the price we 
are paying is that we might confuse two different ob- 
jects that are affinely related. Second, because of the 
non-uniqueness of the camera transformation matrix it 
appears that the set of views S^ is a superset of the set 
of views that could be created by a calibrated camera 
taking pictures of the object. The natural question is 
whether this superset can, nevertheless, be realized by 
a calibrated camera. In other words, if we have a cal- 
ibrated camera (or we know that the internal camera 
parameters remain fixed for all views), then can we gen- 
erate iSjt , and if so how? This question was addressed 
first in [12] but assuming only orthographic views. A 
more general result is expressed in the following propo- 
sition: 

Proposition 1 Given an arbitrary view ip £ S w gener- 
ated by a camera with COP at initial position O, then all 
other views ip £ S w can be generated by a rigid motion 
of the camera frame from its initial position, if in addi- 
tion to taking pictures of the object we allow any finite 
sequence of pictures of pictures to be taken as well. 

The proof has a trivial and a less trivial component. 
The trivial part is to show that an affine motion of the 
camera frame can be decomposed into a rigid motion 
followed by some arbitrary collineation in V 2 . The less 
trivial component is to show that any collineation in V 2 
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can be created by a finite sequence of views of a view 
where only rigid motion of the camera frame is allowed. 
The details can be found in Appendix A. 

The next section treats the projective case. It will 
be shown that this involves looking for invariants that 
remain fixed when any two views of S are chosen. The 
section may be skipped if the reader wishes to get to 
Part II of the paper — only results of affine-depth are 
used there. 

4 Projective Structure and Invariant 
From Two Perspective Views 

Affine depth required the construction of a single ref- 
erence plane, and for that reason it was necessary to 
require that one view remained fixed to serve as a ref- 
erence view. To permit an invariant from any pair of 
views of S, we should, by inference, design the construc- 
tion such that the invariant be defined relative to two 
planes. By analogy, we will call the invariant "projec- 
tive depth" [29]. This is done as follows. 

We assign the coordinates (1, 0, 0, 0), (0, 1, 0, 0) and 
(0,0,1,0) to P\, P'2, P3, respectively. The coordinates 
(0,0,0,1) are assigned to a fourth point P4, and the co- 
ordinates (1, 1, 1, 1) to the COP of the first camera O 
(see Figure 3). The plane passing through P\,P2,P3 is 
denoted by 7Ti (as before), and the plane passing through 
Pi,Ps,P4 is denoted by 7T2. Note that the line OP4 in- 
tersects 7Ti at (1, 1, 1,0), and the line OP2 intersects 7T2 
at (1,0,1,1). 

As before, let Ai be the collineation from the im- 
age plane to 7Ti by satisfying A\pj = tj, j = 1,...,4, 
where e x = (l,0,0),e 2 = (0, 1,0), e 3 = (0,0,1) and 
e 4 = (1, 1, 1). Similarly, let E\ be the collineation from 
the image plane to 7T2 by satisfying E\p\ = t\,E\p2 — 
e.4, Eips = t'i and E\p^ = e^. Note that if A\p = 
(a, /?, 7), then E\p = (/3 — a, /? — 7, f3). We have there- 



fore, that the intersection of the line OP with 7Ti is the 
point P Wl = (a,/3, 7,0), and the intersection with 7T2 is 
the point P 7l2 = (/3 — a, 0, /? — 7, /?). We can express P 
and O as a linear combination of those points: 




Consider the cross ratio k/k' of the four points 
O, P Wl , P W2 , P. Note that k' = 1 independently of P, 
therefore the cross ratio is simply k. As in the affine 
case, k is invariant up to a uniform scale, and any fifth 
object point P (not lying on any face of the tetrahe- 
dron Pi, P2, -P3, -P4) can be assigned k = 1 by choos- 
ing the appropriate scale for Ai (or E\). This has 
the effect of mapping the fifth point P onto the COP 
(P = (1,1,1,1)). We have, therefore, that k (normal- 
ized) is a projective invariant, which we call "projective 
depth" . Relative shape is described as the ratio of a 
point from two planes, defined by four object points, 
along the line to a fifth point, which is also the center 
of projection, that is set up such that its ratio from the 
two planes is of unit value. Any transformation T 6 GL4 
will leave the ratio k invariant. What remains is to show 
how k can be computed given a second view. 

Let A be the collineation between the two image planes 
due to ir 1, i.e., Apj = p'-, j = 1, 2, 3, and Av = v' , where 
v, v' are the epipoles. Similarly, let E be the collineation 
due to 7T2, i.e., Epj —p'j, j = 1,3,4, and Ev = v' . Note 
that three corresponding points and the corresponding 
epipoles are sufficient for computing the collineation due 
to the plane projecting onto the three points in both 
views — this is clear from the derivation in Section 3, 
but also can be found in [28, 29, 23]. We have that the 
projections of P Wl and P W2 onto the second image are 
captured by Ap and Ep, respectively. Therefore, the 
cross ratio of O, P Wl , P W2 , P is equal to the cross ratio of 
v' , Ap, Ep,p', which is computed as follows: 

p' = Ap — sEp, 

v = Ap — s Ep, 

then k = s/s' , up to a uniform scale factor (which is set 
using a fifth point). Here we can also show that s' is a 
constant independent of p. There is more than one way 
to show that, a simple way is as follows: Let q be an 
arbitrary point in the first image. Then, 

v' = Aq - s' Eq. 

Let H be a matrix defined by H = A — s' E. Then, v' = 
Hv and v' = Hq. This could happen only if v' = Hp, 
for all p, and s' = s' . We have arrived to a very simple 
algorithm for recovering a projective invariant from two 
perspective (orthographic) views: 

p' = Ap - nEp, (2) 



where A and E are described above, and k is invariant 
up to a uniform scale, which can be set by observing a 
fifth correspondence p ,p' , i.e., set the scale of E to sat- 
isfy p' g = Ap — Ep . Unlike the affine case, k is invariant 
for any two views from the set S of all possible views. 
Note that k need not be normalized using a fifth point, 
if the first view remains fixed (we are back to the affine 
case). We have arrived to the following result, which is 
a refinement on the general result made in [4] that five 
corresponding points and the corresponding epipoles are 
sufficient for reconstruction up to a collineation in V 3 : 

In case where the location of epipoles are known, 
then four corresponding points, coming from four non- 
coplanar points in space, are sufficient for computing the 
projective structure, up to a uniform but unknown scale, 
for all other points in space projecting onto correspond- 
ing points in both views. A fifth corresponding point, 
coming from a point in general position with the other 
four points, can be used to set the scale. 

We have also, 

Projective shape can be described as the ratio of a point P 
from two faces of the tetrahedron, normalized by the ra- 
tio of a fixed point (the unit point of the reference frame) 
from those faces. 

The practical implication of this derivation is that a 
projective invariant, such as the one described here, is 
worthwhile computing for tasks for which we do not have 
a fixed reference view available. Worthwhile because 
projective depth requires an additional corresponding 
point, and requires slightly more computations (recover 
the matrix E in addition to A). Such a task, for ex- 
ample, is to update the reconstructed structure from a 
moving stereo rig. At each time instance we are given a 
pair of views from which projective depth can be com- 
puted (projective coordinates follow trivially), and since 
both cameras are changing their position from one time 
instant to the next, we cannot rely on an affine invariant. 

5 Summary of Part I 

Given a view ip with image points p, there exists an 
affine invariant k between ip and any other view t/> 8 -, with 
corresponding image points p' , satisfying the following 
equation: 

Up' = Ap — kv' , 
where A is the collineation between the two image planes 
due to the projection of some plane 7Ti projecting to both 
views, and v' is the epipole scaled such that [i p' = 
Ap — v' for some point p . The set of all views S Wl for 
which the camera's center is not on 7Ti will satisfy the 
equation above against ip . The view ip is a reference 
view. 

A projective invariant k is defined between any two 
views ipi and ipj , again for the sake of not introducing 
new notations, projecting onto corresponding points p 
and p' , respectively. The invariant satisfies the following 
equation: 

Up' = Ap — nEp, 



where A is the collineation due to some plane 7Ti , and 
E is the collineation due to some other plane 7T2 scaled 
such that HoP'o = Ap — Ep , for some point p . 



Part II 

6 Algebraic Functions of Views 

In this part of the paper we use the results established in 
Section 3 to derive results of a different nature: instead 
of reconstruction of shape and invariants we would like to 
establish a direct connection between views expressed as 
a functions of image coordinates alone — which we will 
call "algebraic functions of views" . With these functions 
one can manipulate views of an object, such as create 
new views, without the need to recover shape or camera 
geometry as an intermediate step — all what is needed 
is to appropriately combine the image coordinates of two 
reference views. 

Algebraic functions of two views include the expression 



p' 1 Fp = 0, 



(3) 



where F is known as the "Fundamental" matrix (cf. [4]) 
(a projective version of the well known "Essential" ma- 
trix of [16]), and the expression 



cx\x' + a 2 2/' + a 3 x + a 4V + «5 = 



(4) 



due to [10], which is derived for orthographic views. 
These functions express the epipolar geometry between 
the two views in the perspective and orthographic cases, 
respectively. Algebraic functions of three views were in- 
troduced in the past only for orthographic views [31, 21]. 
For example, 

U\x" + CX'ix' + a^x + «4j/ + «5 = 0. 

These functions express a relationship between the im- 
age coordinates of one view as a function of image co- 
ordinates of two other views — in the example above, 
the x coordinate in the third view, x" , is expressed as a 
linear function of image coordinates in two other views, 
similar expressions exist for y" . 

We will use the affine-depth invariant result to de- 
rive algebraic functions of three perspective views. The 
relationship between a perspective view and two other 
perspective views is shown to be trilinear in image coor- 
dinates across the three views. The relationship is shown 
to be bilinear if two of the views are orthographic — a 
special case useful for recognition tasks. We will start by 
addressing the two- view case. We will use Equation 1 to 
relate the entries of the camera transformation A and v' 
(of Equation 1) to the fundamental matrix by showing 
that F = v' x A. This also has an advantage of introduc- 
ing an alternative way of deriving expressions 3 and 4, a 
way that also puts them both under a single framework. 

6.1 Algebraic Functions of Two Views 

Consider Equation 1, reproduced below, 
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By simple manipulation of this equation we obtain: 
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x' a 2 ■ p — y a\ ■ p 

where 01,02,03 are the row vectors of A and v' = 
(v'ijV^v's). After equating the first two terms, we ob- 
tain: 

x'(v' 2 a 3 ■ p - v' 3 a 2 ■ p) + y'(v' 3 ai ■ p - v[a 3 ■ p) + 

(v[a 2 -p- v' 2 ai -p) = 0. (6) 

Note that the terms within parentheses are linear poly- 
nomials in x, y with fixed coefficients (i.e., depend only 
on A and v'). Also note that we get the same expres- 
sion when equating the first and third, or the second and 
third terms of Equation 5. This leads to the following 
result: 

The image coordinates (x,y) and (x',y') of two corre- 
sponding points across two perspective views satisfy a 
unique equation of the following form: 

x'(aix + a 2 y + a 3 ) + y'{a A x + a 5 y + a 6 ) + 
a 7 x + a s y + a 9 = 0, (7) 

where the coefficients aj, j = 1,...,9, have a fixed re- 
lation to the camera transformation A and v' of Equa- 
tion 1: 
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Equation 7 can also be written as p' Fp = 0, where 
the entries of the matrix F are the coefficients ay , and 
therefore, F = v' xA. We have, thus, obtained a new and 
simple relationship between the elements of the "funda- 
mental" matrix F and the elements of the camera trans- 
formation A and v' . It is worth noting that this result 
can be derived much easier, as follows. First, the rela- 
tionship p' Fp = can be derived, as observed by [4], 
from the fact that F is a correlation mapping points 
p onto their corresponding epipolar lines /' in the sec- 
ond image, and therefore p' ■ V = 0. Second 1 , since 
/' = v' x Ap, we have F = v' x A. It is known that 
the rank of the fundamental matrix is 2; we can use this 
relationship to show that as well: 



F = v' x A 



v 2 a 3 - v 3 a 2 
v' 3 a\ — Via 3 
v' 1 a 2 — v' 2 a\ 



where a\,a 2 , a 3 are the row vectors of A. Let f 1 , f 2 , f 3 
be the row vectors of F , then it is easy to verify that 



f 3 = «/i + Pf 
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Next, we can use the result F = v' x A to show how 
the orthographic case, treated by [10], fits this relation- 
ship. In the framework of Equation 1, we saw that with 
orthographic views we have A being affine in V 2 , i.e., 
a 3 ■ p = 1, and v' 3 = 0. After substitution in Equation 6, 
we obtain the equation: 

ctix' + a 2 y' + a 3 x + a 4 y + a 5 = 0, (8) 

where the coefficients ay, j = 1, ..., 5 have the following 
values: 
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These coefficients are also the entries of the fundamental 
matrix, which can also be derived from F = v' x A by 
setting i> 3 = and 03 = (0, 0, 1). 

The algebraic function 7 can be used for re-projection 
onto a third view, by simply noting that the function be- 
tween view 1 and 3, and the function between view 2 and 
3, provide two equations for solving for (x",y"). This 
was proposed in the past, in various forms, by [20, 3, 19]. 
Since the algebraic function expresses the epipolar geom- 
etry between the two views, however, a solution can be 
found only if the COPs of the three cameras are non- 
collinear (cf. [28, 27]) — which can lead to numerical 
instability unless the COPs are far from collinear. The 
alternative, as shown next, is to derive directly alge- 
braic functions of three views. In that case, the coor- 
dinates (V , y") are solved for separately, each from a 
single equation, without problems of singularities. 

6.2 Algebraic Functions of Three Views 

Consider Equation 1 applied between view 1 and 2, and 
between view 1 and 3: 



\ip 
vp" 



Ap — kv' 
Bp-kv". 



(9) 



Here we make use of the result that affine-depth k is 
invariant for any view in reference to the first view. We 
can isolate k again from Equation 9 and obtain: 
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This was a comment made by Tuan Luong. 



where b\,b 2 ,b 3 are the row vectors of B and v" = 
(v'l, v 2 , v 3 ). Because of the invariance of k we can equate 
terms of Equation 5 with terms of Equation 10 and ob- 
tain trilinear functions of image coordinates across three 



views. For example, by equating the first two terms in 
each of the equations, we obtain: 

x"(v[b 3 ■ p - v 3 ai ■ p) + x"x'(v 3 a 3 ■ p - v' 3 b 3 ■ p) + 
x'(v' 3 bi ■ p — v"a 3 ■ p) + v'{a\ ■ p — v' x b\ ■ p = 0. (11) 

This leads to the following result: 

The image coordinates (x,y), (x',y r ) and (x",y") of 
three corresponding points across three perspective views 
satisfy a trilinear equation of the following form: 

x" {cx\x + a 2 y + a 3 ) + x" x' {a^x + a^y -\- a 6 ) + 
x'(a 7 x + a s y + a 9 ) + a w x + any + a V2 = 0, (12) 

where the coefficients aj, j = 1, ..., 12, have a fixed rela- 
tion to the camera transformations between the first view 
and the other two views. 

Note that the x coordinate in the third view, x" , is ob- 
tained as a solution of a single equation in coordinates of 
the other two views. The coefficients ay can be recovered 
as a solution of a linear system, directly if we observe 11 
corresponding points across the three views (more than 
11 points can be used for a least-squares solution), or 
with fewer points by first recovering the elements of the 
camera transforms as described in Section 3. Then, for 
any additional point (x, y) whose correspondence in the 
second image is known (x' , y'), we can recover the corre- 
sponding x coordinate, x" , in the third view by substi- 
tution in equation 12. 

In a similar fashion, after equating the first term of 
Equation 5 with the second term of Equation 10, we 
obtain an equation for y" as a function of the two other 
views: 

y"{Pix + /3 2 y + /3 3 ) + y"x'{p A x + /3 5 y + /3 6 ) + 
x'(j3 7 x + I3 8 y + /3 9 ) + /3 w x + /3 n y + /3 12 = 0. (13) 

Taken together, Equations 5 and 10 lead to 9 algebraic 
functions of three views, six of which are separate for x" 
and y" . The other four functions are listed below: 

*"(•) + z'V(-) + !/(•) + (-) = 0, (14) 

!/"(•) + 2/VO + 1/(0 + (0 = 0, (15) 

X " X '(.) + X "y'(.) + X '(.) + y'(.) = 0, (16) 

!/V(0 + !/V(0 + z'(0 + I/ , (0 = 0, (17) 

where (0 represent linear polynomials in x,y. The so- 
lution for x",y" is unique without constraints on the 
allowed camera transformations. If we choose Equa- 
tions 12 and 13, then v[ and v' 3 should not vanish si- 
multaneously, i.e., v' = (0, 1,0) is a singular case. Also 
v" = (0, 1, 0) and v" = (1, 0, 0) give rise to singular cases. 
One can easily show that for each singular case there 
are two other functions out of the nine available ones 
that provide a unique solution for x" , y" . Note that the 
singular cases are pointwise, i.e., only three epipolar di- 
rections are excluded, compared to the much stronger 
singular case when the algebraic function of two views is 
used separately, as described in the previous section. 

Taken together, the process of generating a novel view 
can be easily accomplished without the need to explicitly 



recover structure (affine depth), camera transformation 
(matrices A, B and epipoles v' , v") or epipolar geometry 
(just the epipoles or the Fundamental matrix) — for the 
price of using more than the minimal number points that 
are required otherwise (the minimal is six between the 
two model views and the novel third view). 

The connection between the general result of trilinear 
functions of views to the "linear combination of views" 
result [31] for orthographic views, can easily be seen by 
setting A and B to be affine in V 2 , and v' 3 = v' 3 ' = 0. 
For example, Equation 11 reduces to: 

v[x" - v'[x' + «ai -p-v'i&i -p) = 0, (18) 

which is of the form: 

cx\x" + a 2 x' + a 3 x + a^y + a 5 = 0. 

In the case where all three views are orthographic, then 
x" is expressed as a linear combination of image coordi- 
nates of the two other views — as discovered by [31]. 

In the next section we address another case, interme- 
diate between the general trilinear and the orthographic 
linear functions, which we find interesting for applica- 
tions of visual recognition. 

6.2.1 Recognition of Perspective views From 
an Orthographic Model 

Consider the case for which the two reference (model) 
views of an object are taken orthographically (using a 
tele lens would provide a reasonable approximation), but 
during recognition any perspective view of the object is 
allowed. It can easily be shown that the three views are 
then connected via a bilinear function (instead of trilin- 
ear): A is affine in V 2 and v' 3 = 0, therefore Equation 11 
reduces to: 

£"(14^3 • P — v 3 a\ ■ p) + v 3 x"x' — 
v"x' + (v"ai ■ p — v[bi ■ p) = 0, 

which is of the following form: 

x" {cx\x + a 2 y + a 3 ) + a^x" x' + 

a<jx' + agx + a 7 y + a$ = 0. (19) 

Similarly, Equation 13 reduces to 



y"(f] lX + f] 2 y + f] 3 )+f] 4 y" X >. 

/3 5 x' + /3 6 x + /3 7 y + I3 8 = 0. 



(20) 



A bilinear function of three views has two advantages 
over the general trilinear function. First, only seven cor- 
responding points (instead of 11) across three views are 
required for solving for the coefficients (compared to the 
minimal six if we first recover A, B, v' , v"). Second, the 
lower the degree of the algebraic function, the less sen- 
sitive the solution should be in the presence of errors in 
measuring correspondences. In other words, it is likely 
(though not necessary) that the higher order terms, such 
as the term x"x'x in Equation 12, will have a higher con- 
tribution to the overall error sensitivity of the system. 

Compared to the case when all views are assumed or- 
thographic, this case is much less of an approximation. 
Since the model views are taken only once, it is not un- 
reasonable to require that they be taken in a special 



way, namely, with a tele lens (assuming we are dealing 
with object recognition, rather than scene recognition). 
If that requirement is satisfied, then the recognition task 
is general since we allow any perspective view to be taken 
during the recognition process. 

7 Applications 

Algebraic functions of views allow the manipulation of 
images of 3D objects without necessarily recovering 3D 
structure or any form of camera geometry (either full, or 
weak — the epipoles). 

The application that was emphasized throughout the 
paper is visual recognition via alignment. In this con- 
text, the general result of a trilinear relationship between 
views is not encouraging. If we want to avoid implicating 
structure and camera geometry, we must have 11 corre- 
sponding points across the three views — compared to 
six points, otherwise. In practice, however, we would 
need more than the minimal number of points in or- 
der to obtain a least squares solution. The question is 
whether the simplicity of the method using trilinear func- 
tions translates also to increased robustness in practice 
when many points are used — this is an open question. 

Still in the context of recognition, the existence of bi- 
linear functions in the special case where the model is 
orthographic, but the novel view is perspective, is more 
encouraging. Here we have the result that only seven cor- 
responding points are required to obtain recognition of 
perspective views (provided we can satisfy the require- 
ment that the model is orthographic) compared to six 
points when structure and camera geometry are recov- 
ered. The additional corresponding pair of points may 
be indeed worth the greater simplicity that comes with 
working with algebraic functions. 

There may exist other applications where simplicity 
is of major importance, whereas the number of points 
is less of a concern. Consider for example, the appli- 
cation of model-based compression. With the trilinear 
functions we need 22 parameters to represent a view as 
a function of two reference views in full correspondence. 
Assume both the sender and the receiver have the two 
reference views and apply the same algorithm for obtain- 
ing correspondences between the two views. To send 
a third view (ignoring problems of self occlusions that 
could be dealt separately) the sender can solve for the 
22 parameters using many points, but eventually send 
only the 22 parameters. The receiver then simply com- 
bines the two reference views in a "trilinear way" given 
the received parameters. This is clearly a domain where 
the number of points are not a major concern, whereas 
simplicity, and probably robustness due to the short-cut 
in the computations, is of great importance. 

Related to image coding is a recent approach of image 
decomposition into "layers" as proposed in [1, 2]. In this 
approach, a sequence of views is divided up into regions, 
whose motion of each is described approximately by a 
2D affine transformation. The sender sends the first im- 
age followed only by the six affine parameters for each 
region for each subsequent frame. The use of algebraic 
functions of views can potentially make this approach 
more powerful because instead of dividing up the scene 
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into planes (it would have planes if the projection was 
parallel, in general its not even planes) one can attempt 
to divide the scene into objects, each carries the 22 pa- 
rameters describing its displacement onto the subsequent 
frame. 

Another area of application may be in computer 
graphics. Re-projection techniques provide a short-cut 
for image rendering. Given two fully rendered views 
of some 3D object, other views (again ignoring self- 
occlusions) can be rendered by simply "combining" the 
reference views. Again, the number of corresponding 
points is less of a concern here. 

8 Summary of Part II 

The derivation of an affine invariant across perspective 
views in Section 3 was used to derive algebraic func- 
tions of image coordinates across two and three views. 
These enable the generation of novel views, for purposes 
of visual recognition and for other applications, without 
going through the process of recovering object structure 
(metric or non-metric) and camera geometry. 

Between two views there exists a unique function 
whose coefficients are the elements of the Fundamental 
matrix and were shown to be related explicitly to the 
camera transformation A,v': 

x'(aix + a 2 y + a 3 ) + y 1 (a A x + a 5 y + a 6 ) + 
a^x + a$y + «9 = 0. 

The derivation was also useful in making the connection 
to a similar expression, due to [10], made in the context 
of orthographic views. 

We have seen that trilinear functions of image coordi- 
nates exist across three views, one of them shown below: 

x" (a\x + a 2 y + as) + x" x' (a^x + a^y -\- a 6 ) + 
x'(a 7 x + a s y + a 9 ) + a w x + any + a V2 = 0. 

In case two of the views are orthographic, a bilinear re- 
lationship across three views holds. For example, the 
trilinear function above reduces to: 

x" (a\x + a 2 y + as) + a^x" x' + 
a<jx' + agx + a 7 y + «8 = 0. 

In case all three views are orthographic, a linear rela- 
tionship holds — as observed in [31]: 

a\x" + a 2 x' + a^x + a^y + a 5 = 0. 

9 General Discussion 

For purposes of visual recognition, by alignment, the 
transformations induced by changing viewing positions 
is at most affine. In other words, a pin-hole uncalibrated 
camera is no more than an "affine engine" for tasks for 
which a reference view ( a model) is available. One of 
the goals of this paper was to make this claim and make 
use of it in providing methods for affine reconstruction 
and for recognition. 

An affine reconstruction follows immediately from 
Equation 1 and the realization that A is a collineation 
of some plane which is fixed for all views. The recon- 
structed homogeneous coordinates are (x,y,l,k) where 



(x,y,l) are the homogeneous coordinates of the image 
plane of the reference view, and k is an affine invariant. 
The invariance of k can be used to generate novel views 
of the object (which are all affinely related to the refer- 
ence view), and thus achieve recognition via alignment. 
We can therefore distinguish between affine and non- 
affine transformations in the context of recognition: if 
the object is fixed and the transformations are induced 
by camera displacements, then k must be invariant — 
space of transformations is no more than affine. If, how- 
ever, the object is allowed to transform as well, then k 
would not remain fixed if the transformation is not affine, 
i.e. involves more than translation, rotation, scaling and 
shearing. For example, we may apply a projective trans- 
formation in V 3 to the object representation, i.e., map 
five points (in general position) to arbitrary locations in 
space (which still remain in general position) and map 
all other points accordingly. This mapping allows more 
"distortions" than affine transformations allow, and can 
be detected by the fact that k will not remain fixed. 

Another use of the affine derivations was expressed in 
Part II of this paper, by showing the existence of alge- 
braic functions of views. We have seen that any view 
can be expressed as a trilinear function with two refer- 
ence views in the general case, or as a bilinear function 
when the reference views are created by means of paral- 
lel projection. These functions provide alternative, much 
simpler, means for manipulating views of a scene. The 
camera geometries between one of the reference views 
and the other two views are folded into 22 coefficients. 
The number 22 is perfectly expected because these cam- 
era geometries can be represented by two camera trans- 
formation matrices, and we know that a camera trans- 
formation matrix has 11 free parameters (3x4 matrix, 
determined up to a scale factor). However, the folding 
of the camera transformations are done in such a way 
that we have two independent sets of 11 coefficients each, 
and each set contains foldings of elements of both cam- 
era transformation matrices (recall Equation 11). This 
enables us to recover the coefficients from point corre- 
spondences alone, ignoring the 3D structure of the scene. 
Because of their simplicity, we believe that these alge- 
braic functions will find uses in tasks other than visual 
recognition — some of those are discussed in Section 7. 

This paper is also about projective invariants, mak- 
ing the point of when do we need to recover a projective 
invariant, what additional advantages should we expect, 
and what price is involved (more computations, more 
points, etc.). Before we discuss those issues, it is worth 
discussing a point or two related to the way affine-depth 
was derived. Results put aside, Equation 1 looks sus- 
piciously similar, or trivially derivable from, the classic 
motion equation between two frames. Also, there is the 
question of whether it was really necessary to use the 
tools of projective geometry for a result that is essen- 
tially affine. Finally, one may ask whether there are sim- 
pler derivations of the same result. Consider the classic 
motion equation for a calibrated camera: 

z'p' = zRp + t. 

Here R is an orthogonal matrix accounting for the rota- 
tional component of camera displacement, t is the trans- 
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lation component (note that t = v'), z is the depth from 
the first camera frame, and z' is the depth value seen 
from the second camera frame. Divide both sides of the 
equation by z, assume that R is an arbitrary non-singular 
matrix A, and it seems that we have arrived to Equa- 
tion 1, where k = — l/z. In order to do it right, one 
must start with an affine frame, map it affinely onto the 
first camera, then map it affinely onto the second cam- 
era, and then relate the two mappings together — it will 
then become clear that k is an invariant measurement. 
This derivation, which we will call an "affine derivation" , 
appears to have the advantage of not using projective ge- 
ometry. However, there are some critical pieces missing. 
First, and foremost, we have an equation but not an al- 
gorithm. We have seen that simple equation counting 
for solving for A and k, given t, from point correspon- 
dences is not sufficient, because the system of equations 
is singular for any number of corresponding points. Also, 
equation counting does not reveal the fact that only four 
points are necessary: three for A and the fourth for set- 
ting a mutual scale. Therefore, the realization that A is 
a homography of some plane that is fixed along all views 
— a fact that is not revealed by the affine derivation — 
is crucial for obtaining an algorithm. Second, the na- 
ture of the invariant measurement k is not completely 
revealed; it is not (inverse) depth because A is not nec- 
essarily orthogonal, and all the other results described 
in Section 3.2 do not clearly follow either. 

Consider next the question of whether, within the con- 
text of projective geometry, affine-depth could have been 
derived on geometric grounds without setting up coor- 
dinates, as we did. For example, although this was not 
mentioned in Section 3, it is clear that the three points 
p' , Ap, v' are collinear — this is well known and can be 
derived from purely geometric considerations by observ- 
ing that the optical line OP and the epipolar line p'v' 
are projectively related in V 1 (cf. [28, 29, 22]). It is less 
obvious, however, to show on geometric grounds only 
that the ratio k is invariant independently of where the 
second view is located, because ratios are not generally 
preserved under projectivity (only cross-ratios are). In 
fact, as we saw, k is invariant but up to a uniform scale, 
therefore, for any particular optical line the ratio is not 
preserved. It is for this reason that algebra was intro- 
duced in Section 3 for the derivation of affine-depth. 

Consider next the difference between the affine and 
the projective frameworks. We have seen that from a 
theoretical standpoint, a projective invariant, such as 
projective-depth k in Equation 2, is really necessary 
when a reference view is not available. For example, as- 
sume we have a sequence of n views ip , i\>\ , . . . , ip n _ i of a 
scene and we wish to recover its 3D structure. An affine 
framework would result if we choose one of the views, 
say ip , as a reference view, and compute the structure 
as seen from that camera location given the correspon- 
dences ip =>• ipi with all the remaining views — this is a 
common approach for recovering metric structure from 
a sequence. Because affine-depth is invariant, we have 
n — 1 occurrences of the same measurement k for every 
point, which can be used as a source of information for 
a least-squares solution for k (or naively, simply average 



the n — 1 measurements). Now consider the projective 
framework. Projective-depth k is invariant for any two 
views ipi, ipj of the sequence. We have therefore n(n — 1) 
occurrences of k which is clearly a stronger source of 
information for obtaining an over-determined solution. 
The conclusion from this example is that a projective 
framework has practical advantages over the affine, even 
in cases where an affine framework is theoretically suffi- 
cient. There are other practical considerations in favor 
of the projective framework. In the affine framework, the 
epipole v' plays a double role — first for computing the 
collineation A, and then for computing the affine-depth 
of all points of interest. In the projective framework, the 
epipoles are used only for computing the collineations A 
and E but not used for computing k. This difference 
has a practical value as one would probably like to have 
the epipoles play as little a role as possible because of 
the difficulty in recovering their location accurately in 
the presence of noise. In industrial applications, for ex- 
ample, one may be able to set up a frame of reference 
of two planes with four coplanar points on each of the 
planes. Then the collineations A and E can be com- 
puted without the need for the epipoles, and thus the 
entire algorithm, expressed in Equation 2, can proceed 
without recovering the epipoles at all. 
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Appendix 



A Proof of Proposition 

Proposition 1 Given an arbitrary view ip £ S w gener- 
ated by a camera with COP at initial position O, then all 
other views ip £ S w can be generated by a rigid motion 
of the camera frame from its initial position, if in addi- 
tion to taking pictures of the object we allow any finite 
sequence of pictures of pictures to be taken as well. 

Lemma 1 The set of views S w can be generated by a 
rigid camera motion, starting from some fixed initial po- 
sition, followed by some collineation in V 2 . 

Proof: We have shown that any view ip £ S w can be 
generated by satisfying Equation 1, reproduced below: 



We he 



P 



Ap — kv' . 



Note that k = for all P £ ir. First, we transform the 
coordinate system to a camera centered by sending 7r to 
infinity: Let M £ GL4 be defined as 



M 



10 
10 
10 

1111 



p' = Ap — kv' 



[A,-v']\ I 



[A, 



]M- 




where xj, = x/(x + y 
and zj = l/(x + y + 1 
in 3D, i.e., R £ GL 3 , 
collineation in V 2 , i.e 



+ l + k),y b = y/(x + y+ 1 + k) 
+ k). Let R be a rotation matrix 
det(R) = 1, and let B denote a 
., B £ GL3, and let w be some 



vector in 3D. Then, we must show that 



p' = BR 




Bw. 
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For every R, B and w, there exists S and u that produce 
the same image, simply be setting S = BR and u = Bw. 
We must also show that for every S and u there exists 
R, B and w that produce the same image: Since S is of 
full rank (becasue A is), then the claim is true by simply 
setting B = SR T and w = B~ l u, for any arbitrary 
orthogonal matrix R. In conclusion, any view ip £ S w 
can be generated by some rigid motion R, w starting 
from a fixed intial position, followed by some collineation 
B of the image plane. Y\ 

We need to show next that any collineation in V 2 
can be expressed by a finite sequence of views taken 
by a rigidly moving camera, i.e., calibrated camera. It 
is worthwhile noting that the equivalence of projective 
transformations (an algebraic concept) with a finite se- 
quence of projections of the plane onto itself (a geometric 
concept) is fundamental in projective geometry. For ex- 
ample, it is known that any projective transformation of 
the plane can be obtained as the resultant of a finite se- 
quence of projections [32, Thm. 10, pp. 74]. The ques- 
tion, however, is whether the equivalence holds when 
projections are restricted to what is generally allowed 
in a rigidly moving camera model. In other words, in 
a sequence of projections of the plane, we are allowed 
to move the COP anywhere in V 3 ; the image plane is 
allowed to rotate around the new location of the COP 
and scale its distance from it along a distinguishable axis 
(scaling focal length along the optical axis). What is not 
allowed, for example, is tilting the image plane with re- 
spect to the optical axis (that has the effect of changing 
the location of the principal point and the image scale 
factors — all of which should remain constant in a cali- 
brated camera). Without loss of generality, the camera 
is set such that the optical axis is perpendicular to the 
image plane, and therefore when the COP is an ideal 
point the projecting rays are all perpendicular to the 
plane, i.e., the case of orthographic projection. 



The equivalence between a sequence of perspec- 
tive/orthographic views of a plane and projective trans- 
formations of the plane is shown by first reducing the 
problem to scaled orthographic projection by taking a 
sequence of two perspective projections, and then using 
a result of [30, 11] to show the equivalence for the scaled 
orthographic case. The following two auxilary proposi- 
tions are used: 

Lemma 2 There is a unique projective transformation 
of the plane in which a given line u is mapped onto 
an ideal line (has no image in the real plane) and 
which maps non-collmear points A, B, C onto given non- 
collmear points A' , B' , C" . 

Proof: This is standard material (cf. [7, pp. 178]). U 

Lemma 3 There is a scaled orthographic projection for 
any given affine transformation of the plane. 

Proof: follows directly from [30, 11] showing that any 
given affine transformation of the plane can be obtained 
by a unique (up to a reflection) 3D similarity transform 
of the plane followed by an orthographic projection. [1 

Lemma 4 There is a finite sequence of perspective and 
scaled orthographic views of the plane, taken by a cali- 
brated camera, for any given projective transformation 
of the plane. 

Proof: The proof follows and modifies [7, pp. 179]. We 
are given a plane a and a projective transformation T. 
If T is affine, then by Lemma 3 the proposition is true. 
If T is not affine, then there exists a line u in a that 
is mapped onto an ideal line under T. Let A,B,C be 
three non-collinear points which are not on u, and let 
their image under T be A',B',C". Take a perspective 
view onto a plane a' such that u has no image in a' (the 
plane a' is rotated around the new COP such that the 
plane passing through the COP and u is parallel to a 1 ). 
Let A\, B\, C\ be the images of A, B, C in a' . Project a' 
back to a by orthographic projection, and let A2, B2, C'2 
be the image of A\,B\,C\ in a. Let F be the resultant 
of these two projections in the stated order. Then F 
is a projective transformation of a onto itself such that 
u has no image (in the real plane) and A, B, C go into 
A-2, B'l^C'i- From Lemma 3 there is a viewpoint and a 
scaled orthographic projection of a onto a" such that 
^4-2, -B2, C'2 go into A' , B' , C", respectively. Let L be the 
resultant of this projection (L is affine). T = FL is a 
projective transformation of a such that u has no image 
and A,B,C go into A',B',C. By Lemma 2, T = f 
(projectively speaking, i.e., up to a scale factor). [1 

Proof of Proposition: follows directly from 

Lemma 1 and Lemma 4. [1 
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